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ABSTRACT 

Treating  the  information  retrieval  task  as  one  of  classifica¬ 
tion  has  been  shown  to  be  the  most  effective  way  to  achieve 
high  performance  on  a  particular  task.  In  this  paper,  we 
describe  a  hybrid  human-computer  system  that  addresses 
the  problem  of  achieving  high  performance  on  IR  tasks  by 
systematically  and  replicably  creating  large  numbers  of  doc¬ 
ument  assessments.  We  demonstrate  how  User  Modeling, 
Document  Assessment  and  Measurement  combine  to  pro¬ 
vide  a  shared  understanding  of  relevance,  a  means  for  rep¬ 
resenting  that  understanding  to  an  automated  system,  and 
a  mechanism  for  iterating  and  correcting  such  a  system  so 
as  to  converge  on  a  desired  result. 

1.  INTRODUCTION 

The  extraordinary  effectiveness  of  the  Relevance  Feedback 
(RF)  paradigm  is  well  established.  Recent  work  [19]  treating 
the  information  retrieval  task  as  a  form  of  classification  has 
demonstrated  that  the  most  effective  way  to  achieve  high 
performance  on  a  particular  task  is  to  acquire  a  large  num¬ 
ber  of  document  assessments.  How  these  assessments  are 
acquired,  however,  is  often  left  unspecified:  within  evalua¬ 
tions,  such  as  the  TREC  series  of  conferences,  assessments 
performed  for  a  particular  task  one  year  are  reused  for  Rel¬ 
evance  Feedback  the  next.  In  real  world,  time-synchronous 
tasks,  we  cannot  wait  for  assessments  before  addressing  the 
task:  such  assessments,  if  they  are  to  be  used,  must  be  cre¬ 
ated  while  addressing  the  task.  In  this  paper,  we  describe  a 
hybrid  human-computer  system  that  addresses  the  problem 
of  achieving  high  performance  on  IR  tasks  by  systematically 
and  replicably  creating  large  numbers  of  document  assess¬ 
ments. 

The  impact  of  large  number  of  document  assessments  has 
been  indirectly  tested  in  previous  TREC  tasks,  including 
those  within  the  Legal  Track  [18].  In  several  cases,  TREC 
tasks  have  been  created  to  test  the  capabilities  of  Relevance 
Feedback  systems.  Testing  such  systems,  however,  imposes 
a  fundamental  challenge  to  the  organizers  of  such  a  task: 
(non-pseudo)  relevance  feedback  presumes  the  existence  of 
feedback  judgments  by  a  user  who  is  knowledgeable  about 
the  topic.  Generating  such  assessments,  however,  is  a  po¬ 


tentially  expensive  proposition,  and  acquiring  a  sufficient 
quantity  of  assessments  to  test  the  asymptotic  properties 
of  the  tested  systems  is  even  more  so.  A  simple  accommo¬ 
dation  is  therefore  applied,  wherein  assessments  produced 
for  a  topic  during  evaluation  of  the  ad-hoc  task  in  previous 
years  are  reused  to  stand  in  for  actual  relevance  assessments 
within  the  RF  task  in  subsequent  years.  Approaching  the 
development  of  training  data  in  this  manner  has  the  effect 
of  easily  affording  the  creation  of  large  amounts  of  relevance 
data  for  the  RF  task. 

The  reuse  of  evaluation  assessments  in  the  RF  task  also  en¬ 
ables  us  to  perform  a  kind  of  gedankenexperiment  to  assess 
the  effect  of  various  sources  of  information  in  the  IR  task. 
In  both  the  original  ad-hoc  task,  conducted  the  first  year, 
and  the  relevance  feedback  task,  conducted  in  subsequent 
years,  the  topic  is  the  same,  allowing  comparison  of  results. 
In  some  cases,  results  have  improved  substantially  between 
the  original  run  of  the  topic  and  subsequent  runs.  We  must 
therefore  examine  what  has  changed  between  the  two  runs 
in  order  to  afford  improved  results.  It  is  possible  that  addi¬ 
tional  understanding  of  the  topic  by  the  experimenters  en¬ 
abled  better  system  design,  but  the  general  focus  on  general 
designs  suggests  that  this  is  not  the  case.  It  is  also  pos¬ 
sible  that  new  or  improved  algorithms  became  available  in 
the  intervening  period  and  that  these  algorithms  produced 
better  results.  That  the  RF  results  were  produced  using 
algorithms  that  have  been  known  for  some  time,  such  as 
SVM,  also  suggests  that  algorithmic  improvements  are  not 
responsible  for  the  improvement.  After  eliminating  other 
possibilities,  it  is  clear  that  the  obvious  difference  between 
the  runs  is  also  that  most  responsible  for  the  exhibited  im¬ 
provements:  namely  the  additional  information  available  in 
the  form  of  document  assessments. 

The  performance  of  information  retrieval  systems  is  there¬ 
fore  seen  to  be  a  function  not  only  of  the  inherent  properties 
of  the  system,  such  as  the  algorithms  used,  but  also  of  the 
information  available  as  input  to  the  system.  Indeed,  the 
nature  of  the  input  information,  including  specifically  the 
quality  and  quantity  of  such  information  is  a  critical  deter¬ 
minant  of  performance.  That  additional  information  can 
bring  improved  results  has  been  recognized  within  the  eval¬ 
uation  community  for  some  time,  as  expressed  through  the 
existence  of  evaluation  tasks  such  as  the  Interactive  [10]  and 
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HARD  [1]  tracks.  Such  evaluations  have  sought  to  bring 
additional  information  to  the  information  retrieval  task  in  a 
controlled  manner,  limiting  both  the  degree  and  the  manner 
of  information  transfer.  To  some  extent,  such  limitations 
were  driven  by  the  need  to  pose  a  controlled  experimental 
paradigm  wherein  observed  improvements  could  be  reason¬ 
ably  attributed  to  the  effect  of  the  additional  information. 
Some  of  the  limitations,  must,  however,  be  attributed  to  the 
difficulty  of  making  available  the  resources  necessary  to  per¬ 
form  information  transfer  experiments  at  large  scale.  The 
results  of  such  experiments,  while  showing  that  additional 
information  does  indeed  help,  are  limited  by  the  fact  that 
the  nature  and  amount  of  information  was  limited  by  the 
experimental  conditions. 

For  the  first  time,  the  Legal  Interactive  task  admits  the  pos¬ 
sibility  of  experimenting  with  large  amounts  of  information 
as  input  to  the  IR  task.  As  stated  in  the  guidelines,  the  pur¬ 
pose  of  the  interactive  task  is  “. . .  to  enable  the  task  to  model 
more  completely  and  accurately  the  conditions  and  objec¬ 
tives  of  e-discovery  in  the  real  world”  [3].  One  such  property 
being  modeled  is  that  of  the  Lead  Attorney  as  user:  although 
document  review  is  typically  delegated  to  more  junior  at¬ 
torneys  or  out-sourced,  it  is  ultimately  the  Lead  Attorney 
whose  notion  of  relevance  must  be  considered.  Although  the 
amount  of  time  (10  hours)  allocated  within  the  Interactive 
Task  for  consultation  with  the  Topic  Authority  (TA)  is  less 
than  that  typically  experienced  in  document  reviews  of  this 
magnitude,  it  is  sufficient  to  explore  more  sophisticated  ap¬ 
proaches  to  interactive  IR  than  have  been  explored  in  TREC 
in  the  past.  In  particular,  because  the  interaction  is  not  lim¬ 
ited  to  a  single  exchange,  iterative  exploration  of  the  topic 
becomes  possible,  as  explained  in  our  analysis,  below. 

At  the  same  time  that  the  Interactive  Task  guidelines  pro¬ 
vide  for  the  possibility  for  incorporating  larger  amounts  of 
input  information  into  the  Information  Retrieval  task,  they 
also  impose  a  much  more  stringent  notion  of  relevance  than 
has  been  required  in  the  past.  While  relevance  was  in  prior 
years  based  on  the  consensus  of  the  reviewers,  and  therefore 
not  completely  defined  until  after  the  task  had  been  com¬ 
pleted,  this  year’s  Interative  guidelines  require  that  the  TA 
come  to  a  fairly  complete  understanding  of  what  relevance 
means  for  a  particular  topic  prior  to  providing  guidance  to 
the  individual  teams.  Of  course  this  ideal  is  not  always  met: 
the  TA  may  change  his  or  her  mind  regarding  relevance,  and 
cannot  help  but  be  influenced  by  discussions  of  relevance 
and  exposure  to  particular  documents.  Interactive  systems, 
therefore,  must  take  into  account  not  only  the  possibility 
that  relevance  is  being  defined  external  to  a  particular  rep¬ 
resentation,  but  that  the  very  notion  of  what  is  relevant  may 
be  changing  over  time.  In  exchange  for  this  added  complica¬ 
tion,  however,  systems  are  provided  with  a  single  target  of 
relevance,  and  are  not  limited  by  the  amount  of  agreement 
that  can  be  achieved  by  uninformed  assessors. 

The  key  questions  to  be  answered  are  therefore  these:  How 
can  we  most  effectively  harness  the  knowledge  that  the  user 
makes  available  to  the  system  in  order  to  improve  perfor¬ 
mance?  Given  limitations  on  the  user’s  time  and  attention, 
what  is  the  best  way  to  structure  the  conversation  with  the 
user  so  as  to  acquire  the  most  information  with  the  least 
effort?  Given  a  certain  amount  of  information,  how  best  to 


Figure  1:  System  Architecture 


go  about  the  task  of  representing  it  in  a  way  that  is  consum¬ 
able  by  an  automatic  system?  And  finally,  how  can  such  a 
system  deal  with  the  real  world  exigencies  posed  by  operat¬ 
ing  in  such  an  environment,  including  a  fallible  user  whose 
interpretation  is  subject  to  change?  These  are  the  questions 
with  which  this  paper  is  concerned. 

2.  SYSTEM  OVERVIEW 

Our  system  comprises  one  main  agent,  the  proxy,  and  four 
separate,  yet  interconnected,  processes:  User  Modeling,  As¬ 
sessment,  Classification  and  Measurement.  A  diagram  is 
shown  in  Figure  1. 

2.1  Proxy 

The  proxy  is  an  internal  agent  who  co-constructs  a  theory 
of  relevance  with  the  user  via  User  Modeling.  The  proxy 
provides  guidance  to  document  assessors  and  resolves  intra- 
and  inter-assessor  discrepancies  to  ensure  that  errors  are  re¬ 
solved  in  favor  of  the  proper  interpretation  of  relevance. 

2.2  User  Modeling 

User  Modeling  is  the  process  by  which  the  proxy  co-deter- 
mines  a  theory  of  relevance  with  the  user  (in  this  case  the 
TA),  iterating  the  process  to  increase  the  likelihood  of  rele¬ 
vance  within  the  system’s  output. 

2.3  Assessment 

The  assessment  process  is  designed  to  (i)  generate  a  large 
amount  of  training  data  (ii)  of  the  appropriate  kind  (iii)  with 
minimal  error.  The  assessment  process  consists  of  an  initial 
assessment  of  all  documents  of  interest  and  subsequent  error 
correction  procedures. 

2.4  Classification 

Document-assessment  pairs  generated  during  assessment  are 
used  as  training  data  for  a  supervised  classification  system. 
The  classifier  is  trained  over  available  assessments  and  the 
resulting  model  used  to  perform  a  binary  classification  of  all 
documents. 

2.5  Measurement 

The  performance  of  the  classification  system  is  regularly 
evaluated  in  order  to  test  its  efficacy.  The  classification 
system  is  run  over  all  documents  in  the  corpus.  Follow¬ 
ing  classification,  a  random  sample  is  drawn  and  reviewed 
by  document  assessors.  Data  generated  by  the  evaluation 
process  are  used  to  tune  the  system  and  may  result  in  the 
proxy  and  user  modifying  the  theory  of  relevance. 

3.  USER  MODELING 
3.1  Introduction 


The  effectiveness  of  an  IR  system  is  measured  on  how  well 
it  retrieves  relevant  text  from  a  corpus.1  Relevance  is  a  de¬ 
rived  property  that  entails  a  user  and  an  information  need: 
a  text  is  deemed  relevant  by  a  user  if  it  satisfies  that  user’s 
information  need  (c/.  [16]).  Thus,  at  some  level  of  an  IR 
system,  there  must  exist  a  representation  of  a  user  and  his 
information  need  (User  Modeling).  Moreover,  User  Model¬ 
ing  (UM)  serves  as  a  powerful  source  of  input  by  providing 
a  mechanism  by  which  external  knowledge  can  be  formal¬ 
ized  into  the  system  via  query  development,  vocabularies, 
etc.  Indeed,  this  year’s  Interactive  Task  appears — in  part — 
predicated  on  this  aspect  of  IR  by  incorporating  into  its 
design  a  Topic  Authority  to  serve  as  a  knowledgeable  yet 
“needful”  user. 

UM  is  understood  as  a  two-fold  endeavor:  (i)  constructing 
a  definition  of  relevance  and  (ii)  iteratively  interacting  with 
a  user  to  increase  the  likelihood  of  relevance  in  the  output. 
We  follow  [17]  in  positing  that  mediated  interaction,  that 
is  interaction  of  a  user,  a  human  intermediary  and  an  IR 
system,  is  the  most  effective  form  of  UM  in  IR.  Within  such 
a  model,  an  intermediary  is  an  “intelligent  agent  construct¬ 
ing,  implementing  and  modifying  user  models  in  all  their 
complexity  with  considerable  feedback”[17].2 

3.2  UM  as  co-construction 

There  are  two  central  tenets  of  our  approach  to  UM:  (i)  a 
user  is  seeking  to  resolve  an  “anomalous  state  of  knowledge” 
and  (ii)  the  user  is  unable  to  precisely  specify  what  informa¬ 
tion  is  needed  to  resolve  the  anomalous  knowledge-state  [4]. 
These  tenets  underlie  our  own  endeavors  as  intermediaries: 
we  are  seeking  to  resolve  an  anomalous  state  of  knowledge 
as  it  pertains  to  satisfying  the  user’s  information  need  and 
we  are  unable  to  precisely  define  what  information  will  sat¬ 
isfy  the  user’s  information  need.  Moreover,  we  recognize 
that  users  and  intermediaries  have  access  to  external  knowl¬ 
edge  sources  (personal  knowledge,  reference  guides,  the  tar¬ 
get  corpus,  etc.)  that  can  be  leveraged  to  inform  and  refine 
the  model.  Thus,  the  act  of  UM  is  a  co-construction  of 
information  needs  and  mutual  knowledge3  in  a  shared  rep¬ 
resentation. 

We  assume  a  model,  based  on  [6]  and  depicted  in  Figure  2, 
in  which  the  representation  serves  as  the  common  ground 
through  which  external  knowledge  is  shared,  mediated,  ne¬ 
gotiated  and  synthesized.  It  is  this  aspect  of  our  approach  to 
UM  that  allows  the  intermediary  to  become  a  proxy  for  the 
user  thereby  permitting  the  proxy  to  arbitrate  whether  infor¬ 
mation  is  assessed  as  relevant  or  not  relevant  (which  allowed 
H5  assessors — at  the  direction  and  guidance  of  the  proxy — 
to  generate  nearly  8000  assessments  for  training  data;  see 
§4  for  further  discussion).  Alternative  approaches  to  UM 

3We  follow  [5]  in  using  text  to  be  an  information-bearing 
object.  Corpus  is  to  be  taken  as  any  collection  of  texts,  that 
is,  any  collection  of  information-bearing  objects. 

2The  relationship  of  the  intermediary  to  the  user  and  the  IR 
system  is  one  of  systems  boundaries.  Buckland  and  Plaunt 
[8]  write  that  “systems  boundaries  define  what  is  considered 
the  ‘system’  and  what  is  considered  the  ‘environment’  ”,  On 
this  definition,  whether  or  not  the  intermediary  is  within  the 
system  is  determined  by  how  integrated  the  intermediary  is 
into  design  of  the  overall  system. 

3  For  more  on  co-construction  of  knowledge  and  mutual  un¬ 
derstanding,  see  [7]  and  [15]. 
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Figure  2:  Representation  of  User  Modeling.  The 
portion  within  the  dashed  lines  is  internal  to  the 
system 


could  require  the  user  to  make  all  8000  assessments  to  serve 
as  training  data.  However,  the  time  constraints  of  TREC’s 
Interactive  Task  make  such  an  approach  infeasible  if  not  im¬ 
possible. 

For  the  Interactive  Task,  UM  comprised  four  component 
areas:  (i)  use  case,  (ii)  scope  (iii)  nuance  and  (iv)  linguistic 
variability.  The  resultant  representation  is  a  description  of 
subject  matter,  that,  if  found  in  a  document,  would  make 
that  document  relevant  (henceforth  Subject  Matter  Model). 


3.2.1  Use  Case 

Use  case  discussions  allowed  us  to  take  into  account  the 
user’s  objectives:  to  produce  to  opposing  counsel  a  set  of 
documents  deemed  responsive  to  the  Request  for  Produc¬ 
tion  (RFP)  [primary  objective]  and  to  mitigate  the  risk  of 
being  accused  of  under-producing  ( i.e .  intentionally  with¬ 
holding  responsive  documents)  or  over-producing  (i.e.  in¬ 
tentionally  delivering  non-responsive  documents)  [secondary 
objective] .  The  decision  to  prioritize  one  risk  over  the  other 
has  far-reaching  design  decisions:  under-production  >  over¬ 
production  implies  a  narrow,  more  exclusive  conception  of 
relevance  whereas  under-production  <  over-production  im¬ 
plies  a  broad,  more  inclusive  conception  of  relevance.  Dur¬ 
ing  UM,  we  learned  the  user  felt  that  the  risk  of  under¬ 
production  accusations  outweighed  the  risk  of  over-produc¬ 
tion  accusations.  Thus,  when  entering  into  scope,  nuance 
and  linguistic  variability  discussions,  we  tested  where  and 
how  the  user’s  risk-mitigation  considerations  might  mani¬ 
fest. 


3.2.2  Scope 

We  define  scope  as  the  breadth  of  concepts  considered  rel¬ 
evant  by  the  user.  When  engaging  in  scope  discussions,  we 
seek  to  define  the  boundaries  of  relevance  for  a  given  concep¬ 
tual  domain.  For  example,  when  engaging  with  the  user  for 


RFP  1034  we  sought  to  understand  how  the  user  interpreted 
retail  marketing  campaigns.  We  analyzed  the  phrase,  creat¬ 
ing  questions  that  tested  the  scope  of  each  word:  types  of 
retail  outlets,  the  activities  that  constitute  marketing,  and 
the  characteristics  of  a  campaign.  Based  on  these  questions, 
we  provided  the  user  examples  to  assess,  discussing  the  rami¬ 
fications  and  logical  extensions  of  her  responses.  We  iterated 
the  process  until  a  shared  definition  was  agreed  to. 

3.2.3  Nuance 

Nuance  refers  to  the  degree  of  specificity  required  to  be  rel¬ 
evant.  In  the  context  of  the  TREC  Interactive  Task,  dis¬ 
cussions  of  nuance  and  specificity  centered  on  the  semantic 
relations  hyponymy  and  hypernymy5.  For  instance,  it  was 
agreed  to  that  a  hyponym  of  campaign,  such  as  Marlboro 
Ranch  (a  name  of  a  specific  marketing  campaign)  should  be 
considered,  in  and  of  itself,  a  marker  of  relevance,  whereas 
the  non-specific  hypernym  campaign  should  not  be  consid¬ 
ered,  in  and  of  itself,  a  marker  of  relevance. 

3.2.4  Linguistic  Variability 

Linguistic  variability  is  related  to,  but  distinct  from,  nu¬ 
ance.  We  define  linguistic  variability  as  the  variety  of  ways 
a  concept  can  be  expressed,  whether  lexically  or  syntacti¬ 
cally.  During  UM,  linguistic  variability  was  discussed  in  the 
context  of  cigarette  brands,  activities  that  constitute  retail 
marketing,  advertising  slogans,  etc.  Two  approaches  were 
evaluated:  defining  each  concept  as  a  closed  set  or  defining 
each  concept  in  terms  of  pertinent  characteristics.  It  was 
determined  that  the  user’s  use  case  (see  §3.2.1)  favored  the 
latter  over  the  former  and  thus,  definition-by-characteristic 
was  built  into  the  representation. 

3.3  Modification 

Belkin  [4]  notes  that  “a  change  in  one’s  state  of  knowledge, 
by  virtue  of  having  engaged  with  text,  will  be  reflected  in 
some  change  in  the  anomalous  state  of  knowledge”.  Because 
our  approach  to  UM  assumes  anomalous  states  of  knowledge 
on  the  part  of  both  the  user  and  proxy,  we  built  into  the  UM 
process  a  “check-in”  procedure  to  occur  during  week  7  of  the 
task:  we  supplied  the  user  with  16  documents,  each  cho¬ 
sen  to  test  whether  the  proxy’s  interpretation  of  relevance 
aligned  with  the  user’s  for  various  aspects  of  the  Subject 
Matter  Model  (SMM).  Of  the  16  documents,  the  user’s  as¬ 
sessment  matched  the  proxy’s  for  14  of  the  16  (one  was  re¬ 
solved  as  H5-internal  assessor  error;  the  other  discrepancy 
triggered  a  modification  to  the  SMM). 

In  subsequent  discussions  concerning  this  discrepancy,  two 
documents  were  discussed:  dug65f00  and  ccq45f00.  The 
user  suggested  the  documents  differed  only  in  degree  of  speci¬ 
ficity  as  it  pertained  to  the  promotion  of  cigarette  brands  via 
media  outlets.  Guidance  was  provided  to  modify  the  SMM 
in  order  to  allow  for  a  broader  interpretation  of  relevance 

4RFP  103 — “All  documents  which  describe,  refer  to,  report 
on,  or  mention  any  “in-store”,  “on-counter”,  “point  of  sale”, 
or  other  retail  marketing  campaign  for  cigarettes.” 

5  Hyponymy  is  the  semantic  relation  in  which  the  extension 
of  a  word  is  subsumed  in  the  extension  of  another  word 

( e.g .  dachshund  is  a  hyponym  of  dog).  Hypernymy  is  the 
semantic  relation  in  which  the  extension  of  a  word  subsumes 
the  extension  of  another  word  {e.g.  dog  is  the  hypernym  of 
dachshund ). 


for  the  portion  of  the  SMM  under  review.  We  modified  the 
SMM  which  necessitated  a  course  correction  for  our  system 
(see  §6  for  further  discussion). 

4.  DOCUMENT  ASSESSMENT 

The  representation  and  quality  of  training  data  is  a,  if  not 
the,  primary  determiner  of  the  success  of  supervised  learning 
[13].  The  presence  of  much  irrelevant  or  unreliable  data 
can  significantly  reduce  the  ability  of  a  learner  to  generalize 
or,  at  best,  increase  the  amount  of  training  data  needed  to 
generalize  properly.6  In  this  section,  we  describe  the  process 
we  used  to  generate  training  data. 

4.1  Goals 

The  motivation  for  the  process  described  here  is  to  (i)  gen¬ 
erate  a  large  amount  of  training  data  (ii)  of  the  appropriate 
kind  (iii)  with  minimal  error.  Assessed  documents  and  asso¬ 
ciated  annotations  form  the  primary  input  to  classification. 
As  descibed  in  the  introduction,  the  amount  of  information 
contained  in  these  artifacts  is  determinative  of  a  high  quality 
result. 

It  is  generally  accepted  at  this  time  that  increased  amounts 
of  training  data  result  in  improved  classification  accuracy 
[9].  During  participation  in  the  task,  we  assessed  over  8000 
documents.  While  this  represents  a  large  number  of  assess¬ 
ments,  it  represents  less  than  1%  of  the  population,  and 
by  itself  cannot  ensure  proper  representability  of  the  topic. 
Additional  mechanisms  are  therefore  employed  to  actively 
determine  likely  sources  of  additional  relevant  documents 
with  distinct  language. 

As  is  usual,  we  distinguish  two  sources  of  error:  random 
error  and  systematic  error.  Random  errors  are  the  less  se¬ 
rious  of  the  two  types  of  error,  and  the  most  easily  han¬ 
dled  by  ordinary  error  checking  mechanisms,  such  as  double¬ 
assessment.  Random  errors  also  have  less  serious  conse- 
qences  for  the  classification  task,  and  can  be  dealt  with  by 
increasing  the  number  of  assessments. [2]  Systematic  errors, 
on  the  other  hand,  pose  a  much  more  serious  challenge,  par¬ 
ticularly  for  a  task  with  a  very  well-defined  target,  such  as 
the  interactive  task.  If  systematic  error  is  allowed  to  in¬ 
filtrate  the  assessments,  the  resulting  system  could  become 
very  highly  targeted  on  a  topic  other  than  that  co-defined 
with  the  user.  Although  simple  consensus  mechanisms  can¬ 
not  combat  systematic  error,  we  discuss  additional  proper¬ 
ties  of  our  error  correction  procedures  that  are  deployed  to 
minimize  systematic  error.  Our  approach  to  minimizing  er¬ 
ror  is  critically  dependent  on  user  modeling. 

In  order  to  ensure  consistency  between  assessors,  a  fraction 
of  assessed  documents  are  independently  assessed  a  second 
time  by  another  assessor.  The  resulting  assessments  are 
compared,  and  disagreements  are  resolved.  While  this  is 
a  fairly  standard  operating  procedure  in  linguistic  annota¬ 
tion  tasks,  the  additional  constraints  imposed  by  the  inter¬ 
file  problem  of  training  data  quality  has  also  been  investi¬ 
gated  within  the  framework  of  computational  learning  the¬ 
ory,  where  it  has  been  shown  that  while  it  is  possible  to 
learn  in  the  presence  of  random  noise  [11],  learning  is  not  in 
general  possible  with  malicious  errors  [12].  In  any  case,  the 
amount  of  training  data  required  to  learn  in  the  presence  of 
noise  is  increased  [2]. 


active  task  mean  that  non-standard  mechanisms  must  be 
employed  to  address  the  disagreements.  If  the  requirement 
of  the  assessment  task  were  merely  to  ensure  that  consen¬ 
sus  had  been  achieved  among  assessors,  then  it  would  suffice 
to  resolve  disagreements  at  the  level  of  the  assessors  them¬ 
selves,  perhaps  by  majority  vote,  or  by  asking  assessors  to 
resolve  their  differences  in  order  to  come  to  an  agreement. 
Such  methods,  however,  while  they  are  able  to  address  ran¬ 
dom  error  such  as  might  occur  through  an  oversight  on  the 
part  of  an  assessor  who  then  might  be  persuaded  to  overturn 
his  or  her  mistake,  cannot  ensure  that  systematic  errors  do 
not  overwhelm  the  true  intent  of  the  topic.  Bringing  mis¬ 
matches  to  the  attention  of  the  proxy,  who  was  instrumental 
in  the  co-constuction  of  the  theory  of  relevance,  ensures  that 
systematic  errors  are  resolved  in  the  favor  of  proper  topic  in¬ 
terpretation. 

4.2  Assessment  Guide 

The  work  of  assessors  is  informed  by  the  theory  of  relevance 
that  the  proxy  has  co-determined  with  the  user.  In  order  to 
communicate  this  intent,  and  to  give  added  guidance  to  as¬ 
sessors  in  specific  cases,  assessment  guidelines  are  drawn  up 
by  the  proxy  and  communicated  to  and  among  the  assessors. 
It  has  been  shown,  by  e.g.  [14],  that  annotator  agreement 
can  be  enhanced  by  increasing  amounts  of  detail  in  an  anno¬ 
tation  guide.  The  purpose  of  the  assessment  guide,  then,  is 
to  provide  detailed  direction  to  assessors  beyond  that  shared 
between  the  user  and  the  proxy.  To  be  sure,  the  guidance 
provided  by  the  proxy  is  grounded  in  his  or  her  understand¬ 
ing  of  the  theory  as  shared  with  the  user.  The  assessment 
guide,  however,  provides  additional  direction  to  the  assessors 
on  how  to  handle  known  and  anticipated  specific  instances 
of  the  topic.  The  assessment  guide  is  also  maintained  as  a 
continuous  record  of  decisions  made  about  particular  cases 
and  the  reasoning  behind  those  decisions. 

4.3  Assessment  Process 

The  assessment  process  we  use  is  designed  to  address  the 
above  goals  while  providing  a  straightforward  and  efficient 
workflow.  The  process  consists  of  an  initial  assessment  per¬ 
formed  on  all  documents  of  interest,  and  subsequent  error 
correction  steps,  performed  on  samples  of  the  population 
with  specific  characteristics.  Although  shown  as  unitary, 
the  process  actually  takes  place  over  time,  and  provides  for 
evolution  of  interpretation  as  new  exemplars  are  sought  and 
identified. 

4.3.1  Initial  Assessment 

Assessors  review  documents  drawn  randomly  using  internal 
sampling  procedures.  Documents  are  assessed  for  relevance 
(R)  or  non-relevance  (NR). 

4.3.2  Relevant  Passage  Identification 

Following  initial  assessment,  a  portion  of  the  documents  that 
have  been  asssessed  as  R  undergo  a  second  round  of  assess¬ 
ment  to  identify  relevant  passages  in  the  document.  Rele¬ 
vant  passages  form  one  of  the  inputs  of  the  classifier,  where 
they  serve  to  narrow  the  focus  to  highly  relevant  portions  of 
potentially  very  long  documents. 

To  extract  relevant  passages,  assessors  re-read  R-assessed 
documents,  and  attempt  to  identify  portions  of  the  text  that 


serve  as  indicators  of  relevance. 

In  addition  to  generating  additional  training  information, 
passage  extraction  serves  the  secondary  purpose  of  validat¬ 
ing  the  initial  assessment  of  relevant  documents.  Documents 
for  which  no  relevant  passage  can  be  found  are  flagged  for 
review  by  the  proxy.  Upon  review,  the  proxy  may  either 
indicate  the  relevant  passage,  leaving  the  document  as  R, 
or  overturn  the  R  assessment  in  light  of  the  lack  of  passage 
evidence,  changing  the  document  assessment  to  NR. 

Although  logically  related  to  assessment,  passage  extraction 
is  performed  separately  by  an  assessor  other  than  the  one 
who  provided  the  initial  assessment.  This  is  done  to  ensure 
that  passage  extraction  fulfills  its  function  as  a  part  of  qual¬ 
ity  control,  insofar  as  a  portion  of  the  relevant  documents 
are  assessed  independently  by  more  than  one  assessor. 

4.3.3  Cross  Check 

Like  R  documents,  documents  with  an  initial  assessment 
of  NR  must  be  quality  checked  via  an  independent  second 
assessment.  However,  unlike  R  documents,  no  relevant  pas¬ 
sages  can  be  expected  in  NR  documents,  and  there  is  little 
marginal  benefit  to  entertaining  a  distinct  process.  There¬ 
fore,  a  portion  of  NR  documents  are  re-reviewed  by  a  second 
asssessor.  Disagreements  between  the  initial  and  second  re¬ 
view  are  identified  and  flagged  for  review  by  the  proxy.  Upon 
review,  the  proxy  may  choose  to  leave  the  document  as  NR, 
or  may  overturn  the  initial  assessment  and  make  the  docu¬ 
ment  R. 

4.3.4  Other  Quality  Controls 

In  addition  to  the  Relevant  Passage  Identification  and  Cross 
Check  procedures  described  above,  which  have  been  explic¬ 
itly  designed  for  quality  control,  improperly  assessed  docu¬ 
ments  are  sometimes  detected  in  other  parts  of  the  system. 
Although  these  ad-hoc  controls  individually  contribute  to 
only  a  small  degree,  taken  together  they  form  a  third  branch 
of  quality  control. 

Because  assessors  differ  in  their  capabilities,  level  of  exper¬ 
tise  and  knowledge  of  the  topic,  additional  quality  control 
measures  are  employed  on  a  per-assessor  basis.  The  proxy 
therefore  randomly  selects  documents  that  have  been  re¬ 
viewed  by  each  assessor  for  spot-checking  until  the  proxy 
is  confident  of  the  assessor’s  abilities. 

5.  MEASUREMENT 

Iterative  approaches  to  information  retrieval,  such  as  rel¬ 
evance  feedback,  clearly  offer  benefits  over  a  one-shot  ap¬ 
proach.  Additional  retrieval  iterations  provide  the  oppor¬ 
tunity  to  uncover  additional  relevant  documents  or  to  re¬ 
fine  judgments  on  previously  identified  documents,  and  can 
therefore  potentially  boost  either  Recall  or  Precision  or  both. 
However,  in  order  to  attain  any  advantage  over  a  single¬ 
shot  system,  the  iterative  system  must  incorporate  addi¬ 
tional  knowledge  during  the  iteration  process.  Nevertheless, 
blindly  incorporating  additional  information  with  no  atten¬ 
tion  paid  to  the  current  state  of  the  system  or  the  likely 
effect  of  such  knowledge,  is  a  blunt  instrument  that  neither 
offers  insight  into  the  progress  of  the  retrieval  process  nor 
provides  direction  concerning  those  next  steps  which  may  be 
most  effective. 


The  alternative  paradigm,  which  we  espouse,  incorporates 
explicit  measurement  of  the  system  at  different  stages  of 
processing.  While  measurement  entails  a  certain  amount  of 
effort,  the  benefits  are  great.  Among  the  primary  benefits 
of  measurement  is  the  insight  it  provides  to  establish  the 
current  state  of  the  system  and  the  degree  to  which  it  has 
attained  desired  outcomes.  While  the  goal  of  systems  in 
the  TREC  task  is  to  establish  the  relative  effectiveness  of 
different  approaches  to  Information  Retrieval,  in  real-world 
applications,  it  is  often  possible  to  set  minimum  standards 
which  will  ensure  that  the  information  needs  of  the  user  are 
being  met  subject  to  other  constraints.  Measurement,  there¬ 
fore  determines  not  only  the  current  state  of  the  system,  but 
also  determines  how  many  iterations  must  be  performed  in 
order  to  achieve  the  desired  outcome. 

In  addition  to  providing  insight  into  an  iterative  process, 
measurement  also  informs  decisions  made  during  execution 
of  the  process  and  provides  the  direction  that  is  necessary 
to  make  considered  changes  in  the  approach.  Thus,  for  ex¬ 
ample,  if  precision  is  seen  to  be  low,  additional  effort  can 
be  expended  to  more  carefully  refine  training  assessments 
to  reduce  errorful  R  assessments.  If,  on  the  other  hand,  re¬ 
call  is  low,  additional  efforts  can  be  expended  to  find  and 
assess  additional  relevant  documents.  Beyond  the  ordinary 
decisions  regularly  taken  during  exercise  of  a  task,  measure¬ 
ment  can  also  be  brought  to  bear  to  deal  with  extraordinary 
circumstances,  such  as  the  topic  reinterpretation  discussed 
in  §3.3  and  §6. 

An  important  component  of  measurement  is  yield,  the  es¬ 
timated  number  of  relevant  documents  in  the  population. 
Calculation  of  yield  is  essential  to  establish  a  target  for  the 
review  process  and  to  determine  progress  toward  that  tar¬ 
get.  Yield  is  calculated  by  drawing  a  random  sample  of  the 
entire  population  and  assessing  it  according  to  the  current 
interpretation  of  relevance.  As  with  all  aspects  of  relevance, 
however,  yield  is  dependent  on  a  correct  interpretation  of 
relevance,  which  can  and  does  change  as  user  modeling  pro¬ 
gresses.  Yield  measurements,  therefore,  must  be  interpreted 
with  the  understanding  that  they  may  change  in  the  future, 
and  should  be  repeated  as  relevance  changes. 

6.  CASE  STUDY 

We  present  in  this  section  an  example  of  User  Modeling 
requiring  modification  to  the  co-determined  theory  of  rele¬ 
vance  and  subsequent  corrections  made  to  the  training  data. 

6.1  Course  correction 

As  mentioned  in  §3.3,  designed  into  UM  was  a  “check-in” 
procedure  to  occur  during  week  7  of  the  task.  The  check¬ 
in  was  implemented  as  a  mechanism  by  which  the  proxy 
could  evaluate  interpretation  discrepancies  that  might  have 
arisen  between  the  user  and  proxy,  in  recognition  that  inter¬ 
action  with  external  knowledge  sources  (such  as  the  corpus) 
impacts  knowledge  states  and  thus  might  necessitate  up¬ 
dating  the  co-defined  theory  of  relevance  ( cf. .  [7]).  During 
the  check-in,  such  a  discrepancy  was  discovered:  the  user 
presented  an  alternate  interpretation  of  relevance  concern¬ 
ing  the  degree  of  specificity  required  for  a  determination  of 
relevance  for  discussions  of  cigarette  brand  promotions  via 
media  outlets.  Prior  to  the  check-in,  a  discussion  of  pro¬ 
moting  a  cigarette  brand  through  a  media  outlet  required  a 


specific  brand  and  specific  media  outlet  be  discussed  for  an 
assessment  of  relevant  to  be  valid  ( e.g .  A  marketing  budget 
indicating  an  advertisement  for  Lucky  Strike  being  placed  in 
Newsweek).  Generality  in  either  domain  did  not  meet  the 
definition  of  relevance  (cf.  Table  1). 


Specific  Media 

Non-specific  Media 

Specific  Brand 

R 

NR 

Non-specific  Brand 

NR 

NR 

Table  1:  Initial  Definition  of  Relevance  -  Promotions 
and  Media 

For  example,  cugl2d00  (Figure  3)  contains  a  discussion  of 
promoting  KOOL  cigarettes  in  various  media  outlets  such  as 
True  Story,  TV  Guide,  and  Us.  Because  the  document  con¬ 
tains  a  discussion  of  promoting  a  specific  brand  via  specific 
media  outlets,  the  document  was  assessed  as  relevant. 


The  user’s  alternate  interpretation  allowed  for  non-speci¬ 
ficity  in  one  domain  but  not  both  (cf.  Table  2). 


Specific  Media 

Non-specific  Media 

Specific  Brand 

R 

R 

Non-specific  Brand 

R 

NR 

Table  2:  Final  Definition  of  Relevance  -  Promotions 
and  Media 

Based  on  this  change  in  interpretation,  the  definition  of  rel¬ 
evance  was  modified  (as  was  the  SMM  and  attendant  ma¬ 
terials  such  as  the  Assessment  Guide).  ais35e00  (Figure  4) 
and  cyol8e00  (Figure  5)  are  examples  of  previously  NR- 
assessed  documents  becoming  R-assessed  documents  due  to 
the  change  in  interpretation. 

ais35e00  contains  a  discussion  of  promoting  a  specific  brand 
MARLBO  (Marlboro)  in  a  non-specific  media  outlet  MAGAZIhC 
(magazine).  On  the  initial  interpretation,  the  specificity  of 
the  brand  was  not  sufficient  to  overcome  the  generality  of 
the  media  outlet  to  trigger  an  R  assessment.  On  the  revised 
interpretation,  specificity  of  the  brand  was  sufficient  to  trig¬ 
ger  an  assessment  of  R  even  with  a  general  media  outlet. 
The  same  held  true  for  cyol8e00  in  which  running  an  ad¬ 
vertisement  for  Marlboro  (specific  brand)  in  a  newspaper 
(non-specific  media  outlet)  was  discussed. 

The  interpretation  modification  discussed  resulted  in  an  in¬ 
crease  in  overall  yield  since  documents  which  contain  dis¬ 
cussions  of  placing  promotional  material  of  specific  brands 
in  non-specific  media  outlets  like  those  found  ais35e00  and 
cyol8e00,  constitute  a  fair  number  of  the  documents  chang¬ 
ed  from  NR  to  R  (for  further  discussion  of  yield,  see  §7). 

7.  RESULTS 

In  the  2008  TREC  Legal  Track  Interactive  Task,  we  ex¬ 
plored  the  application  of  the  process  described  above  on 
Topic  103.  We  iterated,  nine  times,  the  User  Modeling  pro¬ 
cess  described  in  §3,  accumulating  490  minutes  of  interac¬ 
tion  with  the  user.  The  vast  majority  of  that  interaction 
(380  minutes;  77.55%  of  total  time)  occurred  in  weeks  two 
through  four  in  order  to  establish  an  initial  definition  of  rel¬ 
evance  prior  to  starting  work  on  the  topic.  The  remaining 
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Figure  3:  cugl2d00:  Initial  Assessment  (R)  unchanged  during  course  correction 
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Figure  4:  ais35e00:  Assessment  changed  from  NR  to  R  during  course  correction 
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The ! agency  is  to  submit  a  layout  for  a  12M  shipping  case  that  incorporates  a  back- 
to-back  replica  of ’the  package  on  the  face.  A  sample  of  the  Marlboro  soft-pack 
case  will  be  sent  to  the  agency  Monday. 


Figure  5:  cyol8e00:  Assessment  changed  from  NR  to  R  during  course  correction 
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Precision 
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FI 
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95%  Cl 

T103  Final 
T103  Internal 

0.624 

0.687 

(0.579,0.668) 

(0.645,0.730) 

0.810  (0.795,0.824) 

0.823  (0.787,0.860) 

0.705 

0.749 

(0.676,0.734) 

(0.709,0.790) 

Table  3:  Final  Results 
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1  23456789  10 

Figure  6:  Time  spent  with  Topic  Authority 


#  Relevant  %  Relevant 


Week  2 
Week  3 
Week  4 
Week  5 
Week  6 
Week  7 
Week  8 
Week  9 
Final 


693,693 

744,515 

666,828 

636,587 

673,329 

720,810 

729,099 

729,099 

787,762 


10.03874 

10.77420 

9.64996 

9.21233 

9.74403 

10.43115 

10.55110 

10.55110 

11.4 


Table  4:  Estimated  Yield  over  Time 


Figure  7:  Document  Assessments 


110  minutes  (22.45%  of  total  time)  was  taken  by  the  check-in 
in  week  seven,  described  in  §6,  above.  User  Modeling  time 
is  shown  in  Figure  6). 

During  our  participation  in  the  Interactive  task,  assessors 
viewed  7992  documents  to  provide  training  data  for  classi¬ 
fication.  Following  the  description  in  §4,  Figure  7  provides 
a  breakdown  of  the  different  assessment  flows  that  docu¬ 
ments  took,  breaking  results  out  by  number  of  relevant  (R) 
and  non-relevant  (NR)  documents.  The  end  result  was  that 
7992  documents  were  available  for  training:  1180  R,  and 
6812  NR  for  a  training  set  yield  of  14.76%. 

Over  the  running  of  the  task,  measurements  were  conducted 
at  regular  intervals,  as  described  in  §5.  Yield  measurements 
(estimated  number  of  relevant  documents)  are  shown  in  Ta¬ 
ble  4.  Note  that  while  relevance  varied  from  a  9.2%  to  10.7%, 
this  reflects  a  substantial  number  of  documents  in  a  popula¬ 


tion  of  this  size.  After  initial  uncertainty  during  early  user 
modeling  (weeks  1-2),  yield  settles  on  a  downward  trend, 
reaching  a  low  of  9.2%  in  week  5,  due  to  increasingly  strict 
relevance  definition.  Following  the  check-in  with  the  Topic 
Authority  described  in  §6,  however,  relevance  expanded,  and 
this  is  reflected  in  the  measurements  with  yield  eventually 
rising  to  10.5%. 

Table  3  shows  final,  post-adjudicated  results  reported  by 
TREC,  as  well  as  final  internal  estimates.  Although  the 
internal  estimates  are  slightly  higher  than  the  final  TREC 
results,  the  difference  is  well  within  the  confidence  interval. 

8.  CONCLUSIONS 

We  have  presented  a  novel  approach  to  addressing  the  task  of 
large-scale  information  retrieval  in  an  interactive  task  where 
relevance  is  defined  primarily  by  the  judgments  of  a  single 
individual.  The  triad  of  User  Modeling,  Document  Assess¬ 
ment  and  Measurement  combine  to  provide  a  shared  un¬ 
derstanding  of  relevance,  a  means  for  representing  that  un¬ 
derstanding  to  an  automated  system,  and  a  mechanism  for 
iterating  and  correcting  such  a  system  so  as  to  converge  on 
a  desired  result. 

The  problem  of  how  external  notions  of  relevance  are  con¬ 
verted  into  a  computerized  representation  is  deserving  of 
further  research,  with  consequences  not  only  for  the  Legal 
community  but  for  all  areas  of  human  endeavor  with  mas¬ 
sive,  comprehensive  Information  Retrieval  problems. 
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